The purpose of this document is to detail my overall process for exploring a given Case Study, going from accessing the data to finding insights that answer the specified business task.
Goal: Investigating the differences between members and casual riders in their usage of Cyclistic bikes to better understand how to convert riders into members.
Data has been made available by Motivate International Inc. under this license. For the purpose of this business task, all of the 2022 data and the available 2023 data (up to Q3 currently, 20/04/2023) was used.
According to the license, this data is permitted to be used as ‘source material, as applicable, in analyses, reports, or studies published or distributed for non-commercial purposes’.
In this fictional scenario, the data has been collected by Cyclistic, so it will be assumed to be reliable, original and credible for this buisness task.
Security considerations aren’t so relevant as data is accessible online and free-to-use (but not monetize). Regardless, the data was stored on a single-user accessible and password-protected PC.
Initially several CSV files of the same schema were downloaded as ZIP files and extracted and stored in a single directory. Using the below python script, all of the CSV files were merged into a single one.
from os import listdir
from os.path import isfile, join
import pandas as pd
def files(directory):
onlyfiles = [f for f in listdir(directory) if isfile(join(directory, f))]
return onlyfiles
def merge_csv_files(loc):
for p in files(loc):
if '.csv' in p:
l = loc + "/" + p
df2 = pd.read_csv(l)
df = df2 if df is None else pd.concat([df, df2])
The schema of the merged CSV file
columns | types | n_missing | n_uniques | completion_rate |
|---|---|---|---|---|
ride_id | string | 0 | 6,307,141 | 1.0 |
rideable_type | string | 0 | 3 | 1.0 |
start_station_name | string | 921,168 | 1,705 | 0.9 |
start_station_id | string | 921,300 | 1,321 | 0.9 |
end_station_name | string | 985,758 | 1,728 | 0.8 |
end_station_id | string | 985,899 | 1,326 | 0.8 |
member_casual | string | 0 | 2 | 1.0 |
start_lat | numeric | 0 | 1.0 | |
start_lng | numeric | 0 | 1.0 | |
end_lat | numeric | 6,284 | 1.0 | |
end_lng | numeric | 6,284 | 1.0 | |
started_at | datetime | 0 | 1.0 | |
ended_at | datetime | 0 | 1.0 |
The data isn’t entirely complete nor consistent:
Missing start_station_id and end_station_id data
There more unique start_station_name than start_station_id (same for end_station_id)
data entry mistakes?
name changes over time?
do ids represent multiple stations? will location data support this assumption?
Missing end_lat and end_lng data
More data integrity issues were discovered by inspecting the data.
The field to field relationships were investigated for the purpose of normalisation. If any one-to-one relationships are found, the data can be split into multiple tables for more efficient storage (and better data modelling).
Discovered the following relationships:
(end_station_id) many<—->many (end_station_name)
(start_station_id) many<—->many (start_station_name)
end_station_id | end_station_name |
|---|---|
1032 | Public Rack - Kedvale Ave & 63rd St |
1032 | Public Rack - Kedvale Ave & 63rd St W |
1039 | Public Rack - Kedzie Ave & 61st Pl |
1039 | Public Rack - Kedzie Ave & 61st Pl W |
13053 | Green St & Washington Blvd |
13053 | Green St & Randolph St |
Summary:
In simpler terms for both the start and end stations, two different stations by id may have the same name and a single station by id can have multiple names.
Further investigation would need to be carried out to determine the cause of these inconsistencies. A prelim search revealed some potential causes: street name changes; spelling mistakes.
Attempts at normalisation via extraction of a station table from this data set aren’t feasible prior to this investigation.
For now, this report will assume end_station_id and start_station_id to be identifiers for their respective stations.
(end_station_id) many<—->many (end_lng, end_lat)
(start_station_id) many<—->many (start_lng, start_lat)
end_lat | end_lng | id_count |
|---|---|---|
41.66 | -87.55 | 2 |
41.68 | -87.68 | 2 |
41.68 | -87.64 | 2 |
41.68 | -87.63 | 3 |
41.68 | -87.62 | 3 |
41.69 | -87.71 | 4 |
start_station_id | sd_lat | sd_lng | range_distance_variation |
|---|---|---|---|
Hubbard Bike-checking (LBS-WH-TEST) | 0.0620 | 0.0436 | 45,093.84 |
LF-005 | 0.0020 | 0.0018 | 43,332.07 |
600 | 0.0203 | 0.0091 | 42,275.41 |
599 | 0.0298 | 0.0144 | 41,396.61 |
596 | 0.0240 | 0.0049 | 41,357.62 |
605 | 0.0128 | 0.0025 | 40,653.37 |
Summary:
A single coordinate (lat, lng) can have multiple station ids associated with it and vice versa.
Insertion or rounding errors don’t explain why many stations ids correspond to many different locations.
By applying Haversine distances between the minimum and maximum coordinates found for a given station ID, 565 start station ids had a distance of over 1km.
# Check the rideable_type categories
bike_categories <- unique(X2022_202303_divvy_tripdata$rideable_type)
print(bike_categories)
## [1] "electric_bike" "classic_bike" "docked_bike"
# Check the member_casual categories
member_categories <- unique(X2022_202303_divvy_tripdata$member_casual)
print(member_categories)
## [1] "casual" "member"
# Check the date range
date_range <- c(min(X2022_202303_divvy_tripdata$started_at), max(X2022_202303_divvy_tripdata$ended_at))
print(date_range)
## [1] "2022-01-01 00:00:05 UTC" "2023-04-03 11:41:11 UTC"
clean_data <- X2022_202303_divvy_tripdata %>% filter(!(is.na(start_station_id) & is.na(end_station_id)))
Using the datetime and latitude/longitude fields, a duration and distance could be calculated for each ride respectively.
# Used Haversine distance in library(geosphere)
example_clean_data <- clean_data %>% drop_na(start_lat, end_lng, end_lat, start_lng) %>% head()
h_distances <- apply(example_clean_data[c('start_lng', 'start_lat', 'end_lng', 'end_lat')], 1, function(x) distm(c(x[1], x[2]), c(x[3], x[4]), fun=distHaversine))
example_clean_data <- cbind(example_clean_data, h_distances)
# Calculating duration:
example_clean_data_with_metrics <- example_clean_data %>% mutate(ride_duration_mins=round(as.numeric(ended_at-started_at)/60, 3)) %>% rename(haversine_distance=h_distances)
flextable(example_clean_data_with_metrics %>% select(ride_id, ride_duration_mins, haversine_distance))
ride_id | ride_duration_mins | haversine_distance |
|---|---|---|
D789EFF6D43AB2A1 | 11.267 | 1,468.216 |
9B9FEDC78CCF828F | 3.417 | 0.000 |
241C440C74CB31BB | 9.067 | 1,280.955 |
02DA3A10370A3F18 | 13.233 | 4,892.140 |
76C81BC148654648 | 9.733 | 1,388.439 |
53A7590B28ED25E2 | 0.750 | 0.000 |
Insights can be found by grouping the rides by member or casuals and summarising based on a variety of different distance and duration and count aggregations for each group. Below show all the aggregations chosen:
basic_summary <- clean_data_with_metrics %>% group_by(member_casual) %>%
summarize(number_of_rides=n(),
mean_h_distance=mean(haversine_distance, na.rm=TRUE),
median_h_distance=median(haversine_distance, na.rm=TRUE),
total_h_distance=sum(haversine_distance, na.rm=TRUE),
std_h_distance=sd(haversine_distance, na.rm=TRUE),
mean_duration_mins=mean(ride_duration_mins, na.rm=TRUE),
median_duration_mins=median(ride_duration_mins, na.rm=TRUE),
total_duration_mins=sum(ride_duration_mins, na.rm=TRUE),
std_duration=sd(ride_duration_mins, na.rm=TRUE),
number_of_electric=sum(rideable_type=='electric_bike'),
number_of_classic=sum(rideable_type=='classic_bike'),
number_of_docked=sum(rideable_type=='docked_bike'), .groups = "drop")
counts_per_bike_type <- basic_summary %>% select(member_casual, number_of_electric, number_of_classic, number_of_docked, number_of_rides)
distances_summary <- basic_summary %>% select(member_casual, total_h_distance, median_h_distance, mean_h_distance, std_h_distance)
durations_summary <- basic_summary %>% select(member_casual, total_duration_mins, median_duration_mins, mean_duration_mins, std_duration)
bike_type_proprotions = counts_per_bike_type %>% mutate(electric=100*number_of_electric/number_of_rides,
classic=100*number_of_classic/number_of_rides,
docked=100*number_of_docked/number_of_rides) %>%
select(member_casual, electric, classic, docked) %>%
gather(key="rideable_type", value="percentage_of_rides", 2:4)
ggplot(data=bike_type_proprotions) +
geom_bar(mapping=aes(y=rideable_type, x=percentage_of_rides), stat='identity') +
facet_wrap(~member_casual) +
labs(title="Bike Type Choices",
y="Type of bike",
x="Percentage of rides")
durations_by_member <- durations_summary %>% mutate(total_duration_hours=as.numeric(total_duration_mins/60)) %>%
select(member_casual, median_duration_mins, total_duration_hours, mean_duration_mins, std_duration)
flextable(durations_by_member)
member_casual | median_duration_mins | total_duration_hours | mean_duration_mins | std_duration |
|---|---|---|---|---|
casual | 13.033 | 1,135,391.9 | 30.10545 | 283.31525 |
member | 8.683 | 746,042.4 | 12.51538 | 29.81876 |
distances_by_member <- distances_summary %>% mutate(total_distance_km=as.numeric(total_h_distance/1000),
mean_distance_km=as.numeric(mean_h_distance/1000),
std_distance_km=as.numeric(std_h_distance/1000),
median_distance_km=as.numeric(median_h_distance/1000)) %>%
select(member_casual, median_distance_km, total_distance_km, mean_distance_km, std_distance_km) %>%
mutate_if(is.numeric, ~round(., 1))
flextable(distances_by_member)
member_casual | median_distance_km | total_distance_km | mean_distance_km | std_distance_km |
|---|---|---|---|---|
casual | 1.7 | 4,932,813 | 2.2 | 9.5 |
member | 1.5 | 7,482,507 | 2.1 | 12.9 |
# Finding out where the stations are located
states <- map_data('state')
data_by_end_location <- data_by_end_location %>% filter(end_lng < 0)
ggplot() +
geom_polygon(data=states, mapping=aes(x=long, y=lat, group=group)) +
geom_text(data = states %>% group_by(region) %>% summarize(long=mean(long), lat=mean(lat)), mapping=aes(x=long, y=lat, group=region, label=region)) +
geom_point(data=data_by_end_location, mapping=aes(x = end_lng, y = end_lat), col = "red")
illinois <- states %>% filter(is.element(region, c('illinois')))
ggplot() +
geom_polygon(illinois, mapping=aes(x=long, y=lat, group=group)) +
geom_point(data=data_by_end_location, mapping=aes(x = end_lng, y = end_lat), col = "red")
The stations congregate in the North Eastern region of Illinois, specically within and around the Chicago area.
Loading a shapefile of Chicago grouped by community:
chicago_sf <- st_read(sf_loc)
## Reading layer `geo_export_ec6e177b-af8b-4325-b6b1-922000facde1' from data source `/Users/kimathinyota/CyclisticAnalysis/SHAPES/Chicago_community_areas/geo_export_ec6e177b-af8b-4325-b6b1-922000facde1.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 77 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS: WGS84(DD)
ggplot(data=chicago_sf) +
geom_sf()
Each location, start or end, were grouped by their given Chicago community using the Chicago shape file. A subset of this aggregated data by community is shown below.
community | total_members | total_casuals | total_classic | total_docked | total_riders | total_starts | total_ends | mid_latitude | mid_longitude |
|---|---|---|---|---|---|---|---|---|---|
ALBANY PARK | 14,783 | 9,749 | 7,800 | 723 | 24,532 | 5,834 | 13 | 41.96539 | -87.71158 |
ARCHER HEIGHTS | 1,303 | 1,495 | 0 | 0 | 2,798 | 96 | 5 | 41.80362 | -87.72482 |
ARMOUR SQUARE | 45,444 | 25,317 | 36,965 | 3,454 | 70,761 | 11,501 | 16 | 41.84575 | -87.63337 |
ASHBURN | 242 | 142 | 0 | 0 | 384 | 5 | 15 | 41.74600 | -87.70411 |
AUBURN GRESHAM | 1,182 | 1,426 | 824 | 61 | 2,608 | 483 | 19 | 41.74284 | -87.65254 |
AUSTIN | 5,373 | 9,052 | 3,090 | 387 | 14,425 | 2,307 | 39 | 41.88648 | -87.76170 |
chicago_sf_with_stats <- chicago_sf_with_stats %>% mutate(member_percentage=round(100*total_members/total_riders,3), total_riders_per_thousand=total_riders/1000)
ggplot(data=chicago_sf_with_stats) +
geom_sf(aes(fill=member_percentage)) +
geom_point(aes(x = mid_longitude, y = mid_latitude, size=total_riders_per_thousand), alpha=0.2) +
labs(title="Station locations in Chicago",
fill="Percentage of members (%)",
size="Total number of riders (in thousands)",
x="Longitude",
y="Latitude",
alpha=NA)